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Abstract 

In this paper we present a Bayesian image 
zooming/super-resolution algorithm based on a 
patch based representation. We work on a patch 
based model with overlap and employ a Lo- 
cally Linear Embedding (LLE) based approach 
as our data fidelity term in the Bayesian infer- 
ence. The image prior imposes continuity con- 
straints across the overlapping patches. We ap- 
ply an error back-projection technique, with an 
approximate cross bilateral filter. The problem of 
nearest neighbor search is handled by a variant of 
the locality sensitive hashing (LSH) scheme. The 
novelty of our work lies in the speed up achieved 
by the hashing scheme and the robustness and in- 
herent modularity and parallel structure achieved 
by the LLE setup. The ill-posedness of the image 
reconstruction problem is handled by the intro- 
duction of regularization priors which encode the 
knowledge present in vast collections of natural 
images. We present comparative results for both 
run-time as well as visual image quality based 
measurements. 



1 Introduction 

Inferring a high resolution image from a single or a given 
set of spatially coherent low resolution images has been a 
problem which has been studied by numerous researchers 
in the past. Different algorithms have been proposed to this 
end which find implementations in the real world to han- 
dle different scenarios as well as end goals. For instance, 
super-resolution of faces has an entirely different set of lit- 
erature which takes help from the vast field of face iden- 
tification and detection, whereas, super-resolution of natu- 
ral images still remains in the domains of interpolation and 
optimization techniques. Simple applications like increas- 
ing the size of a digital image to complex scenarios like 
identifying suspects from a surveillance video can all be 



modeled as some form of super-resolution problem. Most 
of the techniques proposed till now can be broadly clas- 
sified into two main categories. Iterative time/frequency 
domain methods and learning based techniques. Iterative 
techniques can be traced back to the seminal work by Tsai 
and Huang |9|, which was then modified to the first itera- 
tive implementation by Irani et. al [11 1. Iterative meth- 
ods use a Bayesian framework wherein an initial guess 
about the high resolution frame is refined at each iteration. 
The image prior is usually assumed to be a smoothness 
prior. Iterative methods are inherently context indepen- 
dent and hence a few problems encountered in cases like 
eyes in face enhancement still remain tough problems for 
the iterative methods. Geman and Geman (61, introduced 
an optimization framework wherein a compatibility model 
of high-resolution to low-resolution image features is con- 
structed, and then the inverse reconstruction problem can 
be solved by searching in the compatibility space. Their 
work proposed modeling the unknown image as a Markov 
Random Field (MRF) where neighboring image regions are 
assumed to be Markovian in nature, thereby simplifying the 
inference problem. Given enough training data, searching 
through compatibility space techniques have been shown 
to work wonderfully well. Freeman et al |5|, studied nat- 
ural image super-resolution using Belief Propagation (BP) 
algorithm |[T6ll for inference in MRFs. The results clearly 
show the superiority of the proposed method, but the run 
time for the algorithm remained a bottleneck for the practi- 
cal implementation of their algorithm. 

Another interesting approach was proposed by Chang et. 
al. 1 2 1, wherein they posed learning based super-resolution 
as locally linear manifold estimation problem based on the 
work by Roweis et al. ifTSl . Their method was based on the 
fact that low-resolution image patches can be represented 
by linear combinations of a few local patches which are 
the nearest neighbors of the test patch in the low-resolution 
training set. Corresponding high-resolution patches are 
then linearly combined and the high-resolution patch is 
thereby inferred. The convex set of coefficients learned for 
the low-resolution space are translated as is to the high- 
resolution space. Boundary issues were handled by sim- 



pie averaging over pixel values. Example based super- 
resolution methods have a similar flavor to the region grow- 
ing based texture synthesis technique proposed by Efros 
and Leung |[4|. Texture synthesis can be assumed to be 
a sub-problem to generic super-resolution, which tries to 
exploit the regularities present in pure textures, wherein 
a candidate region is grown based on sampling from the 
space of the nearest neighbors. This works efficiently for 
pure textures, but when the scene consists of greater details 
than one uniform texture, which is characterized by an in- 
herent sense of repeatability, then the manifold based tech- 
nique produces much better results. A similar method has 
been proposed for face enhancement, commonly known as 
hallucination, by Park et. ah 1 15 1 wherein they propose lin- 
earity preserving projections. Manifold based techniques 
have also been used for successful image retrieval as inves- 
tigated by He et. al. O. But for most of these methods 
the nearest-neighbor search still presents a bottleneck for 
practical implementations. 

In recent years, several researchers proposed to avoid 
the running time bottleneck by using approximation tech- 
niques |[TJ[T0l|7l. Locality Sensitive Hashing (LSH), which 
tries to address a similar problem was proposed by Indyk 
et. al. 1 10 |. The key idea is to hash the points using several 
hash functions so as to ensure that, for each function, the 
probability of collision is much higher for objects which 
are close to each other than for those which are far apart. 
One can determine near neighbors by hashing the query 
point and retrieving elements stored in buckets containing 
that point. The algorithm proposed in |7 1 worked in the 
hamming space and such it assumed the data points to be 
in {0, 1}^ space. In this work we extend the work of Datar 
et. al. O. Their work introduced the idea of distri- 
butions to generate the hash functions and showed that this 
technique can be implemented directly on the data space 
without transformation to the hamming space. 

2 Motivation 

In this paper we present a method which builds on the al- 
gorithm proposed by Chang et. al. 0, with a few key dif- 
ferences which render this work unique and the first of its 
kind to the best of our knowledge. The nearest neighbor 
estimation step mentioned in the original work by Roweis 
et. al. ifTSl , has been used as is in the work by Chang et. 
al. Q. This technique presents the most crucial bottleneck 
in the scalability as well practical use of their algorithm. 
We counter this deficiency by employing Locality Sensi- 
tive Hashing scheme which works well in scenarios where 
nearest neighbor (NN) problem can be approximated by 
an (i?;c)-NN problem |3|, defined later in Sec 



approximation of the bilateral filter which has been shown 
to preserve edges while simultaneously suppressing noise. 

The rest of the paper is organized as follows. We introduce 



the notations in Sec 2.1 which would be consistently used 



3.2 



For 

the method proposed by Chang et. al. |2|, the image re- 
construction quality quickly deteriorates due to the lack 
of effective regularization. We counter this by a back- 
propagation technique, where we introduce an optimized 



for the rest of this paper. In Sec |3.1[ Sec |3.2| and Sec |3.3| we 
introduce the mathematical details for the three main tech- 
niques which form the crux of our work. Sec |4] illustrates 
the main body of this work. We enlist the modifications and 
enhancements made to each of the techniques mentioned 
earlier, to make them work for our application. We present 
the experimental details in Sec [5] Finally, we conclude our 
work with discussions about ongoing and future work in 
SeclH 

2.1 Notations 

Let {Xi,X2 . . . ,Xt} G X denote the set of high- 
resolution images, from which we generate the correspond- 
ing set of low-resolution images {Yi, I2 • • • , ^t} ^ Y by 
first blurring by a Gaussian kernel /C(<j^), where de- 
notes the spread of the kernel, and then down-sampling by 
a factor P = 2, when not specified explicitly. The low- 
resolution images are then interpolated back to the original 
size of the high-resolution images, so that corresponding 
images are of the same size. Individual low and high reso- 
lution image pairs are then divided into overlapping patches 
denoted by xj^yj, where i G {1,2,. ..,T} denotes the im- 
age index, and j G {1, 2, . . . , A^^} denotes the patches in 
the i^^ low/high image pair in the training data. We will 
drop the image index in the later sections when it is clear 
that only patch index is required to identify the low/high 
resolution patch pairs. The patches are numbered in row- 
principle order for each image. Note, that the patch index 
runs through all the images in the training set assigning 
unique integer ID's to each patch from each image in the 
training data. The notation Np specifically denotes the total 
number of patches generated for the entire data set. Also 
note that there are Np high resolution patches and Np low- 
resolution patches which need to be explicitly stored onto 
the disk. 



3 Background 

3.1 Locally Linear Embedding (LLE) 

LLE as a manifold learning technique was introduced by 
Roweis et. al. |18|, wherein they proposed a local neigh- 
borhood based linear estimation technique to unfold com- 
plex manifolds. The key idea is to compute locally lin- 
ear weights to approximate the actual high-dimensional 
space. This method has found many applications in the 
fields of machine learning and computer vision, and Chang 
et. al. f2l were one of the first group of researchers to pro- 
pose an LLE based algorithm for image super-resolution. 



Suppose there are N points in a high-dimensional space 
of dimension D. The N points are assumed to He on a 
manifold of dimension d, where d <^ D. The principle as- 
sumption of this technique is that if the manifold is smooth, 
meaning the local changes in the topology can be assumed 
to be linear, then a sample on the manifold can be approxi- 
mated by a linear combination of a few of its neighbors. As 
such the two main components of the LLE algorithm are 

1. For each data point in the D-dimensional data space: 

(a) Find the set of K nearest neighbors in the same 
space. 

(b) Compute the reconstruction weights of the neigh- 
bors that minimize the reconstruction error. 

2. Compute the low-dimensional embedding in the d di- 
mensional embedding space such that it best preserves 
the local geometry represented by the reconstruction 
weights. 

3.2 Locality Sensitive Hashing (LSH) 

Nearest neighbor problem can be relaxed by posing it as 
an {R; c)-NN problem, where the aim is to return a point 
p which is at a distance cR from the query point g, when 
there is a point in the data set P within distance R from 
q. LSH technique was introduced by Indyk et. al. 1 10 |, to 
solve the {R; c)-NN problem . For a domain S of the points 
set with distance measure D, an LSH family is defined as: 

Definition 1 A family M = {h : S ^ U} is called 
(ri; r2]Pi]P2)— sensitive for D if for any v, q G S 

ifv e B{q; ri) then PrM[h{q) = h{v)] > pi 
ifv ^ B{q; r2) then Pru[h{q) = h{v)] < p2. 

where B{q; ri) is the ball of radius ri centered at q and 
Pi > P2- 

k such hash functions are concatenated to form a new 
function class G = {g : S ^ U^} such that g{v) = 
{hi{v), h2{v)^ . . . ^ hk{v)), where hi e M. L such func- 
tions {gi, g2^ • ' • ^ 9l) are chosen independently uniformly 
at random and form the buckets into which the train- 
ing patches are hashed. During testing the test patches 
are hashed into the buckets and all the matching training 
patches are extracted as the nearest neighbors. 

3.3 Bilateral filter 

Bilateral filtering |19|, is a non-linear filtering technique 
which combines image information from spatial domain as 
well as the feature domain. It can be represented by the 
following equation 
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where I and h are the input and the output images respec- 
tively, X and y are pixel locations, and T{x) is the neigh- 
borhood around the pixel x. 

k{x)= ^ c{x,y)s{l{x),l{y)) (2) 

yerix) 

is a normalization term at pixel x. Normally the function 
c(.) is defined as 



c{x,y) = exp- 
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and similarly the filter s{.) is defined as 



s{x,y) = exp ■ 
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4 Bayesian Inference with back-propagation 

Denote the input low resolution (LR) image as and the 
true high resolution (HR) image as Ih- Now we can write 



p{Ih\Il) ocp{Il\Ih)p{Ih) 



(5) 



The maximum likelihood estimate then gives an estimate 
for the best Ih- the error image which is the deviation 
from the current inference to the observed image is then 
computed as 



(6) 



Here is a smoothing kernel and ^ represents the convolu- 
tion in spatial domain. Back-propagation technique, intro- 
duced by Irani et. al. ifTTIl . adds a smoothed version of the 
error image to the current estimate of the HR image. 
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p is the back-projection filter and can be similar to g. We 
substitute a bilateral filter kernel for the back-projection fil- 
ter as mentioned in the forthcoming sections. 

For a patch based representation, assuming Gaussian noise 
model, the negative log likelihood term can be written as 



N 
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(8) 



where N is the total number of patches of size p x pin the 
image and D{y^ x) is a metric discussed in detail in the later 
sections. We also assume that the sample rate discrepancies 
have been absorbed into D{.). The image prior is assumed 
to be a smoothness constraint imposed on the overlapping 
boundary of the patches. For two neighboring patches in 
the HR image the image prior is denoted as 



yerix) 



log p{Ih) = 

neighbor s{p,q) 



'-H 



||2 

H Woverlap 



2al 



(9) 



Now let us look at the likelihood term in greater detail. For 
each patch in the LR image, we find the closest patches 
in the LR dataset and compute the LLE reconstruction 
weights. Then, we impose the same reconstruction weights 
on the corresponding HR patches and reconstruct the HR 
patch. The search in the LR space is usually the speed bot- 
tleneck for such methods. We implement a variant of the 
LSH technique to alleviate this problem. 

4.1 Hashing 

Datar et. al |3|, showed that Gaussian distribution is a 
p-stable distribution under I2 norm, and hence preserved 
distances after projection by a Gaussian random vector. It 
has been proved that projection by Cauchy random vectors 
preserve li distances II2TII . Recent research work on the 
statistics of natural images (IllllOl, has shown that natural 
images have sparse derivatives, which follow an approx- 
imate Laplacian distribution. This hints at an li type of 
distance metric for the nearest neighbor searching. Based 
on this knowledge we chose the following formulation for 
our hash function 
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where r is an integer chosen to balance the hash bins, {3 
is a uniform random number chosen from [0, r] and a is a 
Cauchy random vector whose elements are sampled from 
the rule 



a - A/'(0,1) 
h - A/'(0,l)5.t. 

ai — a/h 
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where A/'(0, 1) represents a normal distribution, v is some 
feature representation of a low-resolution patch yj and [x\ 
denotes the integer less than or equal to x. We concatenate 
k = 3 such hash functions and form 1 hash table. Based on 
the k integers we form the bins in the table and store the in- 
teger ID corresponding to the patch yj into the hash table. 
L = 30 such hash tables are formed for matching the test 
patches against the training patches. We introduce a hash 
match criterion during matching which states that a match- 
ing patch ID found in one hash table should be present in 
at least t out of L hash tables. 

Feature representation: The low-resolution patches are 
represented as gradient based features for hashing. We 
use different feature representations ranging from first or- 
der gradient (convolution with [—1,0,1], [—1,0,1]^) to 
second order gradient (convolution with [—1,0, 2, 0,-1], 
[-1,0,2,0,-1]^). 

Hashing Quahty (determining r): Entropy has been one 
of the most widely accepted metrics for measuring the qual- 
ity of hashing. In our case entropy directly translates to 



the speed of generating the list of all the approximate near- 
est neighbors. But improving the entropy can sometimes 
lead to poor image reconstruction quality, which hints at a 
tradeoff for practical implementations. The denominator r 
in Eqn[TO| is used to control the quality of hashing, 

4.2 Locally Linear Embedding 

Let us denote the test patch as y^. We query for the test 
patch into each hash table and collect all the y'^ matches 
found, where m G 1,2,.. .,M. We purposefully drop the 
image index as introduced in Sec |2.1[ because after the 
training stage all patches are treated equally, independent 
of their originating images. If M is larger than some pre- 
defined number of matches (typically 3 x L) further search- 
ing is stopped. Once the matches have been identified, we 
solve the following constrained least square minimization 
problem 
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(13) 



Once, the weight matrix is computed the high-resolution 
patch corresponding to y^ can simply be written as 



M 
J = l 



(14) 



For, all the results shown later in the paper, we have used 
5x5 patches with one pixel width overlap. Going back to 
the metric D{.) introduced in Eqn.jsj the metric is equiva- 
lent to solving the LLE reconstruction equation in Eqn. 12 
based on the neighbors found by the hashing scheme. 

4.3 Approximate Bilateral Error Propagation 

Normally, the filter s{.) is defined as in Eqn.|4] But in this 
paper, we present a different formulation for this filter. The 
main concern with numerous researchers with the simple 
form of bilateral filter is the fact that, the complexity of the 
filter grows as O(r^). In view of the kernel size bottleneck, 
we define the filter s{.) as 



s(I(x), r(x)) = exp - 



|I(^)-Iri)]E,.r(.) 1(^)11^ 



(15) 

where the notation [.] just denotes the cardinality of the 
neighborhood We also borrow from Paris et. al. L14 L and 
adopt the idea of cross bilateral filtering wherein, I{x) and 
I{y) can come from different images. Since this filter oper- 
ates on the error image, we replace I{y) terms with le(^), 
where Ig represents the error between the original LR im- 
age and bi-cubic interpolated HR image. The new formula- 
tion of the range filter s{x) can now be exploited, since the 



mean over the filter kernel, can be computed extremely ef- 
ficiently. Again, borrowing from Porikli 1 17], we adopt the 
integral image technique, which renders the evaluation of 
the filter s{.) in Eqn.[4j an 0(1) operation. The new filter 
can now be written as 

\ tit \ rt ^^ ^ye^(.)I(^M^.^) .... 

h{x) = s{l{x),T{x)) -TT~\ 

k[x) 

Assuming a normalized filter C, and denoting the filter re- 
sponse for for the entire image as I5, we can write the 
filter response succinctly as 

H = I, (17) 

where (g) means an elementwise multiplication. 

5 Experiments and Results 

5.1 Training Data Generation 

One of the key benefits of our method over others is the 
fact that it can handle relatively large amounts of training 
data. Because of the approximate nearest neighbor search, 
facilitated by the LSH formulation, the inference algorithm 
only looks at a sub-part of the training data. To gener- 
ate the training data for the experiments shown in Fig |4] 
we collected 500 high resolution images from |[T2l . We 
compute the horizontal and vertical first order gradient at 
each pixel location of an image and then generated a cu- 
mulative distribution function based on the gradient values. 
We sampled for the center pixel location and then saved a 
100 X 100 image centered at this pixel. 25 such samples 
were drawn at random from each of the 500 images chosen 
form [12] and then saved as training images. Each train- 
ing image is then blurred and downsampled to form the 
low-resolution counterpart for the high resolution training 
images. For a patch size of 5 x 5, with an overlap of 2 pix- 
els, we collect a total of around 800 patch pairs from one 
image. The total number of training patch pairs is of the 
order of 800 x 25 x 500 ~ 10^ patch pairs. Examples of 
the sub-images collected from one image is shown in Fig[T] 
For all the results none of the test images were present in 
the training set. For Fig [2] we added a small part of the 
image in the training set. 

5.2 Comparative Results 

We present the comparative results for our algorithm 
against standard image interpolation techniques like bicu- 
bic interpolation. We also compare our results against sim- 
ilar learning based techniques like Chang et. al | 2| and 
Freeman's method 0. We present comparative run-time 
analysis against Chang fl\, where we count one nearest 
neighbor as one operation. Our analysis clearly proves 
the time efficiency achieved by incorporating the hashing 
scheme into the linear embedding technique used in |i2J. 




Figure 1 : Generation of training data. Left: original image, 
right: gradient based sampling of smaller sub-images. 



The first set of results were performed on the little girl im- 
age which was presented in |5 | and has since been used 
by numerous researchers for comparison. Fig [2] presents 
the comparative results of our method against |5|. Note 
that Fig 2(c) has been taken from the author's websit^ 
One primary benefit of our method over | 5 1 is the fact that 
our method goes through all the patches in the test image 
just once, compared to an iterative belief propagation rou- 
tine in 1 5 1, and hence can handle large amount of training 
data, while still being able to generate acceptable results. 
40,000 training patch pairs were reported in | 5 1 and they 
run the process for upto 10 iterations. The run time for their 
method on our system (Pentium 4, 3Mhz, 2MB RAM) was 
of the order of 6 minutes for one iteration, which results to 
around 60 minutes runtime for an image which is of size 
280 X 280. Our method runs in one iteration and takes 
around 1 minute for generating the results. Note that our 
training set for this experiment contained around 300,000 
patch pairs, which is roughly 3 times that handled by [5J. 



We compare our run times against Chang et. al. as pre- 
sented in Fig|3] As hinted earlier, we denote each similarity 
against a patch in the training data base as one time unit and 
count the number of similarity checks we need to perform 
to generate the results. Since comparing against the train- 
ing patches is the most important part of the algorithm (and 
also the most time consuming one), this can be treated as a 
fair comparison for the run times of the two algorithms. We 
compute mean number of similar patches generated by our 
algorithm to infer all the patches of the test image. For L2J 
this number remains fixed at the total number of patches in 
the training set. Fig [3] (top) shows the MSE for 9 differ- 
ent experiments averaged 20 times each. The two curves 
are almost similar strengthening our claim that the approx- 
imate nearest neighbor hashing scheme performs almost at 
par with the complete nearest neighbor scheme. On aver- 
age we underperform by a factor of about 0.025 (for images 
which are all larger than 200 x 200 pixels)which is negligi- 
ble compared to the time gain which we accomplish by our 
method. Fig |3] (bottom) shows the comparison of the qual- 
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Figure 2: 4X magnification for the girl image. 

ity factor Q for the two methods. Note that the quality com- 
parison results are shown in log scale to reduce the wide 
gain achieved by our method over Chang's Q method, and 
plot both the curves in the same figure. As mentioned ear- 
lier, we learn 5 hash tables and hence deal with 5 times the 
training data, but still manage to generate improvements in 
run time by orders of magnitude. The maximum improve- 
ment in runtime is around 2 on the log scale which is about 
100 times. The average time gain is about 70 times over 
Chang's f2l method. The run times reported in this paper 
are all generated on a single machine. The structure of the 
algorithm can be exploited to break the problem into simi- 
lar sub-problems and can be run on different machines. 

Next set of results compare our method against bicubic in- 
terpolation as shown in Fig |4] Since, bicubic interpola- 
tion by itself cannot match our results, we perform an ad- 
ditional unsharp operation and report the comparative re- 
sults in Fig |4] We also perform similar experiments for 
de-blurring and zooming. Fig [5] shows input blurred image 
and the result of our algorithm along with the original HR 
image. Fig |6] shows the input image stretched two times 
and the result of our algorithm along with the original HR 
image. 

6 Conclusion and Future work 

In this paper we present a truly large scale system able to 
handle large amounts of data in acceptable time for image 
zoomiong/super-resolution. We introduce the idea of LSH 
for feature based image patch hashing for efficient approx- 
imate nearest neighbor identification in close to constant 



Figure 3: Comparison of our method against Chang et. 
ah 1 2 1, Top: Mean square error. Bottom: run time in log 
scale. Note that the best result (image 3) points to a time 
saving by a factor of 100. 

time. An LLE based scheme, is then employed to infer the 
high-resolution patches. We present image reconstruction 
results as well as run-time comparisons against established 
learning based methods. 
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